common/parser: add proper reasoning tag prefill reading by pwilkin · Pull Request #20424 · ggml-org/llama.cpp

pwilkin · 2026-03-11T20:49:33Z

This changes the erroneous behavior of the autoparser that ascribed thinking behavior to templates. As people rightly mentioned, some models have dynamic or hybrid reasoning - they can reason or not depending on some switches and even the template behavior can change due to this (i.e. inserting <think> in assistant prefill after a "no_think" appears in a user message).

Therefore, the FORCED_OPEN and FORCED_CLOSED formats are gone. The parser will now just detect models with tagged reasoning, i.e. an opening and closing reasoning marker (deleted DELIMITER also since it's a special case with the opening marker being empty). However, it will check the assistant prefill for those markers and will append them to the input for the grammar and the parser, so that they are taken into account, therefore just simplifying the parsing mechanism since it doesn't now have to differentiate whether the <think>' / ` was added by the template or generated by the model.

pwilkin · 2026-03-11T20:51:47Z

Fixes #20356
Fixes #20325
Fixes #20265

This also clears the ground for disabling grammar triggers inside reasoning loops in a subsequent PR, which would resolve #20260

aldehir · 2026-03-11T21:17:53Z

Dumb question, why not find the start of the assistant message and prepend that?

I agree it would be easier to parse if we had a "prefill" of some sort that normalizes the input, such that we can handle the logic in the grammar and not through flags. However, if we're going this route I would look into prepending the start of the entire assistant message. This will also open the door for parsing output from requests with an assistant prefill.

pwilkin · 2026-03-11T21:30:35Z

Yeah, that would be the logical conclusion, but for now it's easier for me just to extract the reasoning markers since finding the actual start of the assistant message is nontrivial.

aldehir · 2026-03-11T23:09:26Z

Qwen3.5 uses

<think>\n\n</think>\n\n

{%- if enable_thinking is defined and enable_thinking is false %}
{{- '<think>\n\n</think>\n\n' }}
{%- else %}
{{- '<think>\n' }}
{%- endif %}

however,

      "reasoning_prefill": "<think></think>\n\n",

It probably doesn't matter for this model, but it is technically not adhering to the template.

aldehir · 2026-03-11T23:15:21Z

    {
      "id": 248045,
      "piece": "<|im_start|>"
    },
    {
      "id": 74455,
      "piece": "assistant"
    },
    {
      "id": 198,
      "piece": "\n"
    },
    {
      "id": 248068,
      "piece": "<think>"
    },
    {
      "id": 271,
      "piece": "\n\n"
    },
    {
      "id": 248069,
      "piece": "</think>"
    },
    {
      "id": 271,
      "piece": "\n\n"
    }

Maybe set reasoning_prefill from the start of the opening tag to the end of the prompt?

aldehir · 2026-03-11T23:32:03Z

finding the actual start of the assistant message is nontrivial.

Run the template once with add_generation_prompt = false, capture the size, run again with true, extract the string content that spans the delta? I think that would work in most cases.

pwilkin · 2026-03-12T00:27:07Z

That usually works, yeah 😀 I can try that and see what the results are (this is what calculate_diff_split from the analyzer does BTW). I'm just worried about some weird edge cases.

bsdice · 2026-03-14T11:10:37Z

Nice patch! With model https://huggingface.co/mradermacher/Qwen3.5-40B-Claude-4.5-Opus-High-Reasoning-Thinking-GGUF the patches fix webui getting confused on /think and not splitting correctly reasoning and generation part. Build llama.cpp-cuda-git-b8334.r9.710878a7dd-1.

pwilkin · 2026-03-14T14:50:50Z

@aldehir changed the prefill extraction behavior to the differential one you mentioned.

aldehir · 2026-03-14T19:06:06Z

common/chat.h

    std::string                         grammar;
    bool                                grammar_lazy         = false;
-    bool                                thinking_forced_open = false;
+    std::string                         prefill;


Think we name this generation_prompt? It lines up with the add_generation_prompt flag.

aldehir · 2026-03-14T20:39:17Z

common/chat-auto-parser-generator.cpp

+    bool clear_reasoning_start = false;
+    if (inputs.reasoning_format != COMMON_REASONING_FORMAT_NONE &&
+        autoparser.reasoning.mode != reasoning_mode::NONE &&
+        !autoparser.reasoning.end.empty()) {
+        const auto & r_start    = autoparser.reasoning.start;
+        const auto & r_end      = autoparser.reasoning.end;
+        auto         r_end_t    = trim_trailing_whitespace(r_end);
+        auto         r_start_t  = trim_trailing_whitespace(r_start);
+
+        if (!r_start_t.empty()) {
+            auto start_pos = prompt_to_search.rfind(r_start_t);
+            if (start_pos != std::string::npos) {
+                std::string from_start = prompt_to_search.substr(start_pos);
+                auto         fs_trimmed = trim_trailing_whitespace(from_start);
+
+                if (string_ends_with(fs_trimmed, r_end_t)) {
+                    data.prefill = r_start + r_end;
+                } else if (string_ends_with(fs_trimmed, r_start_t)) {
+                    data.prefill = from_start;
+                } else {
+                    clear_reasoning_start = true;
+                }
+            }
+        }
+    }


So my understanding is: we have a generation prompt G, we can create a parser that accepts G[0:max(G.size(), G.index_of(reasoning_start))] + (reasoning_start + reasoning + reasoning end)? + .... Then we can do away with all the trim logic.

The benefit is that now the parser can properly parse assistant prefill from the user, since the parser starts from the beginning of the assistant message.

I see that Mistral's templates have no generation prompt, so G = "". But this is fine, because the model emits the [THINK] tag. So the above still works.

This workaround is mostly for Apriel that has a delimited thinking format and inserts a header "Thinking chain starts here: " or something like that as the generation prompt which acts as a quasi-reasoning marker that we want to strip.

pwilkin · 2026-03-15T01:40:25Z

@aldehir okay, that rewrite ended up being a bit bigger than I expected... but it's exactly the algorithm you mentioned now.

aldehir · 2026-03-15T01:59:19Z

Oh jeez, well it's <100 net LOC. I'll give it a whirl.

pwilkin · 2026-03-15T15:03:13Z

@aldehir happy to report I added another nice piece of code to make it work correctly with grammars / schemas :)

strawberrymelonpanda · 2026-03-19T22:13:08Z

With OpenCode and Unsloth's Qwen3.5 35B, I seem to be getting all think blocks in the response now with </think> attached.

This is with "show thinking" off.

Wasn't happening yesterday, so I'm guessing it's related?

pwilkin · 2026-03-19T22:19:00Z

@strawberrymelonpanda argh.

Can you check with vanilla template?

strawberrymelonpanda · 2026-03-19T22:24:38Z

Willing to, is there a command I can use to bypass unsloth's or do I need a different model?

strawberrymelonpanda · 2026-03-19T22:31:05Z

@pwilkin looks like it's happening on Ubergarm's quant too.

strawberrymelonpanda · 2026-03-19T22:37:26Z

@pwilkin I rolled back to the commit right before this PR, and I no longer see the thinking content and tags.

With show thinking turn ON on commit c125883, using Ubergarm's quant:

(turned on otherwise there'd be nothing to show it's working)

pwilkin · 2026-03-20T01:10:57Z

@strawberrymelonpanda can you give the exact server command?

I'm trying to repro on various Qwen3.5 4B quants but all seems correct so far.

strawberrymelonpanda · 2026-03-20T01:36:32Z

GGML_CUDA_GRAPH_OPT=1 \
LLAMA_SERVER_SLOTS_DEBUG=1 \
llama-server \
--seed 1 \
--threads 8 \
--host 127.0.0.1 \
--port <port> \
--props \
--no-mmap \
--direct-io \
--models-preset ./presets.ini \
--models-max 1 \

[*]
spec-type = ngram-mod
spec-ngram-size-n = 24
draft-min = 48
draft-max = 64
parallel = 1
flash-attn = on
fit = on

; Qwen
[qwen35-35b]
load-on-startup = true
model = <path>/Qwen3.5-35B-A3B-Q4_K_S (unsloth).gguf
;model = <path>/Qwen3.5-35B-A3B-Q4_0 (ubergarm).gguf
mmproj = <path>/Qwen3.5-35B-A3B-mmproj-bf16.gguf
fit-target = 1800
ctx-size = 80000
temp = 0.6
top-p = 0.95
top-k = 20
min-p = 0.00

(Tested with both models)
OpenCode version 1.2.26

Start opencode in llama.cpp folder, @ commit c1b9116 (master).

I ran the command

Add a sleep API endpoint that triggers the result of sleep_idle_seconds

(I was seeing if it could, for local use, because I'd like this without a full unload - I think there's some differences?)

For this command in specific, I first see thinking work twice:

Then start failing:

I can't promise it's related, but my OpenCode is set to manual update and hasn't changed, and after rolling back the </think> tags stopped entirely.

The same commands on c125883

and continues on throughout the task without issue.

…on) (#20777) * chat : fix out_of_range crash in throw path (#20424 regression) #20424 introduced effective_input = generation_prompt + input, but the throw path uses input.substr(result.end) where result.end is a position within effective_input. Every thinking model with a non-empty generation_prompt crashes with std::out_of_range instead of the intended error message. Test crashes on unpatched master, passes with fix: cmake -B build -DLLAMA_BUILD_TESTS=ON -DLLAMA_BUILD_TOOLS=OFF cmake --build build --target test-chat ./build/bin/test-chat * Update test-chat.cpp * Update test-chat.cpp * Update test-chat.cpp --------- Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>

strawberrymelonpanda · 2026-03-20T01:42:49Z

@pwilkin I'm using a deterministic seed, --seed 1, so with any luck maybe you can reproduce it.

Not sure how much the RNG varies based on hardware though.

pwilkin · 2026-03-20T01:43:58Z

Can you please try without the speculative decoding as well? (gtg to sleep now but will try to repro tomorrow)

strawberrymelonpanda · 2026-03-20T01:47:32Z

Same results without

spec-type = ngram-mod
spec-ngram-size-n = 24
draft-min = 48
draft-max = 64

For that command, works twice, fails twice, etc.

strawberrymelonpanda · 2026-03-20T02:38:11Z

If you're not familiar with OpenCode, here's a opencode.json config file that should work with Llama.cpp. It's a stripped down version of my own, don't think I removed anything important. Just some permissions and extra models.

opencode.json

{
  "$schema": "https://opencode.ai/config.json",  

  "share": "disabled",
  "autoupdate": false,
  "enabled_providers": ["llama-cpp"],
  
  "formatter": false,
  "lsp": false,  

  "compaction": {
    "auto": true,
   	"prune": true,
   	"reserved": 10000
  },
  
  "permission": {  	
    "webfetch": "ask",
    "websearch": "ask",    
    "bash": {
      "*": "ask",
    }
  },

  "provider": {
    "llama-cpp": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "llama-server (local)",
      "options": {
        "baseURL": "http://127.0.0.1:<port>/v1"
      },
      "models": {
        "qwen35-35b": {
          "name": "qwen35-35b",
          "modalities": {
            "input": [ "text", "image" ],
            "output": [ "text" ]
          },
          "limit": {
            "context": 80128,
            "output": 99999
          }          
        }
      }
    }
  },

  "model": "llama-cpp/qwen35-35b",
  "small_model": "llama-cpp/dummy"
}

"small_model": "llama-cpp/dummy" is because there was recently an issue where, unless small model was set to something, it was sending data to their servers to get session titles from GPT Nano. Giving it a dummy just makes it fail and use a timestamp.

You can remove this line, but as of recent commits it'll cause the large model to try to make the title instead, which can be a significant delay, so I keep it for a fast fail.

pwilkin · 2026-03-20T11:09:00Z

@strawberrymelonpanda I test everything on OpenCode, which is why I'm surprised to see this arise. I'll try to repro on the smaller models first, but I have a 35B quant lying around somewhere if I can't.

pwilkin · 2026-03-20T17:47:34Z

@strawberrymelonpanda got a repro, looking into it.

pwilkin · 2026-03-20T18:38:38Z

Aaaaand it's gone... reproduced once, set up an MITM proxy and now it's gone :P

* Implement proper prefill extraction * Refactor cli parameters, update docs, move reasoning budget sampler part to common/reasoning-budget.cpp * Update tools/server/server-task.cpp * refactor: move grammars to variant, remove grammar_external, handle exception internally * Make code less C++y Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

…regression) (ggml-org#20777) * chat : fix out_of_range crash in throw path (ggml-org#20424 regression) ggml-org#20424 introduced effective_input = generation_prompt + input, but the throw path uses input.substr(result.end) where result.end is a position within effective_input. Every thinking model with a non-empty generation_prompt crashes with std::out_of_range instead of the intended error message. Test crashes on unpatched master, passes with fix: cmake -B build -DLLAMA_BUILD_TESTS=ON -DLLAMA_BUILD_TOOLS=OFF cmake --build build --target test-chat ./build/bin/test-chat * Update test-chat.cpp * Update test-chat.cpp * Update test-chat.cpp --------- Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>

pwilkin · 2026-03-20T22:47:20Z

@strawberrymelonpanda happy to report I found the cause :)

Can you please check if #20825 resolves it?

strawberrymelonpanda · 2026-03-21T00:14:27Z

Looks good. 👍

pwilkin requested review from aldehir, allozaur, ggerganov and ngxson as code owners March 11, 2026 20:49

github-actions bot added documentation Improvements or additions to documentation testing Everything test related examples server labels Mar 11, 2026

pwilkin mentioned this pull request Mar 11, 2026

Eval bug: unsloth/Qwen3.5-35B-A3B-GGUF peg-native chat format parser fails when model outputs text before <tool_call> (thinking model + tool calling) #20260

Open

loci-dev mentioned this pull request Mar 12, 2026

UPSTREAM PR #20424: common/parser: add proper reasoning tag prefill reading auroralabs-loci/llama.cpp#1247

Open

aldehir mentioned this pull request Mar 13, 2026

Eval bug: Response always starts with </think> tag when running Qwen3.5 9B #20516

Open

pwilkin force-pushed the reasoning-prefill branch from 3bfb08f to 4083259 Compare March 14, 2026 14:49

aldehir reviewed Mar 14, 2026

View reviewed changes

pwilkin requested review from a team as code owners March 15, 2026 15:02

pwilkin mentioned this pull request Mar 15, 2026

Handle reasoning budget #20297

Merged

pwilkin added 2 commits March 19, 2026 16:04

Nits and bolts

b0fcb4b

Rebuild webui

72b9fa3

pwilkin force-pushed the reasoning-prefill branch from a75c1f8 to 72b9fa3 Compare March 19, 2026 15:12

pwilkin mentioned this pull request Mar 19, 2026

Eval bug: DeepSeek V3.2 no longer reasons (<think>) when using jinja template from DeepSeek V3.2 Exp and no reasoning_content is returned by the server #20717

Open

pwilkin merged commit 5e54d51 into ggml-org:master Mar 19, 2026
48 of 51 checks passed

pwilkin deleted the reasoning-prefill branch March 19, 2026 15:58

This was referenced Mar 19, 2026

chat: new parser should not crash inference #20708

Closed

Autoparser misclassifies all output as reasoning for templates with /no_think toggling (Nemotron-Nano-9B-v2) #20754

Closed

pwilkin mentioned this pull request Mar 20, 2026

Eval bug: Qwen3.5, when resuming an interrupted completion, <think> tags are inserted into the output token stream #20768

Open

pwilkin mentioned this pull request Mar 20, 2026

Eval bug: Qwen3-30B-A3B-Instruct-2507 displays incorrect thinking sections #20550

Closed

pwilkin mentioned this pull request Mar 20, 2026

common/parser: fix nasty bug causing subtle corruption of generation prompt #20825

Merged

Conversation

pwilkin commented Mar 11, 2026

Uh oh!

pwilkin commented Mar 11, 2026

Uh oh!

aldehir commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pwilkin commented Mar 11, 2026

Uh oh!

aldehir commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aldehir commented Mar 11, 2026

Uh oh!

aldehir commented Mar 11, 2026

Uh oh!

pwilkin commented Mar 12, 2026

Uh oh!

bsdice commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pwilkin commented Mar 14, 2026

Uh oh!

aldehir Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

aldehir Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pwilkin Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

pwilkin commented Mar 15, 2026

Uh oh!

aldehir commented Mar 15, 2026

Uh oh!

pwilkin commented Mar 15, 2026

Uh oh!

Uh oh!

strawberrymelonpanda commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pwilkin commented Mar 19, 2026

Uh oh!

strawberrymelonpanda commented Mar 19, 2026

Uh oh!

strawberrymelonpanda commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

strawberrymelonpanda commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pwilkin commented Mar 20, 2026

Uh oh!

strawberrymelonpanda commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

strawberrymelonpanda commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pwilkin commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

strawberrymelonpanda commented Mar 20, 2026

Uh oh!

strawberrymelonpanda commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pwilkin commented Mar 20, 2026

Uh oh!

pwilkin commented Mar 20, 2026

Uh oh!

pwilkin commented Mar 20, 2026

Uh oh!

pwilkin commented Mar 20, 2026

Uh oh!

strawberrymelonpanda commented Mar 21, 2026

Uh oh!

aldehir commented Mar 11, 2026 •

edited

Loading

aldehir commented Mar 11, 2026 •

edited

Loading

bsdice commented Mar 14, 2026 •

edited

Loading

aldehir Mar 14, 2026 •

edited

Loading

strawberrymelonpanda commented Mar 19, 2026 •

edited

Loading

strawberrymelonpanda commented Mar 19, 2026 •

edited

Loading

strawberrymelonpanda commented Mar 19, 2026 •

edited

Loading

strawberrymelonpanda commented Mar 20, 2026 •

edited

Loading

strawberrymelonpanda commented Mar 20, 2026 •

edited

Loading

pwilkin commented Mar 20, 2026 •

edited

Loading

strawberrymelonpanda commented Mar 20, 2026 •

edited

Loading